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ABSTRACT 

This paper presents CUBEP 3 M, a high performance, publicly-available, cosmological N-body 
code and describes many utilities and extensions that have been added to the standard package, 
including a runtime halo finder, a non-Gaussian initial conditions generator, a tuneable accu- 
racy, and a system of unique particle identification. CUBEP 3 M is fast, has a memory imprint up 
to three times lower than other widely used N-body codes, and has been run on up to 20, 000 
cores, achieving close to ideal weak scaling even at this problem size. It is well suited and has 
already been used for a broad number of science applications that require either large samples 
of non-linear realizations or very large dark matter N-body simulations, including cosmolog- 
ical reionization, baryonic acoustic oscillations, weak lensing or non-Gaussian statistics. We 
discuss the structure, the accuracy, any known systematic effects, and the scaling performance 
of the code and its utilities, when applicable. 

Key words: N-body simulations — Large scale structure of Universe — Dark matter 



1 INTRODUCTION 

Many physical and astrophysical systems are subject to non-linear 
dynamics and rely on N-body simulations to describe the evolu- 
tion of bodies. One of the main field of application is the mod- 
elling of large scale structures, which are driven by the sole force of 
ravity. Recent obser vations of the cosmic micro wave background 
Kom atsu et alj2009ll201 ll). of galaxy clustering dYork et alj200ol : 
IColless et all2003l;ISchlegel et al.l2009l:lDrinkwater et all201oliof 
weak gravitational lensing dHevmans & CFHTLenS Collaboration! 
|2012|; Sheld on et al . 2009) and of supernovae redshift-distance re- 
lations all point towards a standard model of cosmology, in which 
dark energy and collision-less dark matter occupy more than 95 per 
cent of the total energy density of the universe. In such a paradigm, 
pure N-body codes are perfectly suited to describe the dynamics, as 
long as baryonic physics are not very important, or at least we un- 
derstand how the baryonic fluid feeds back on the dark matter struc- 
ture. The next generation of measurements aims at constraining the 
cosmological parameters at the per cent level, and the theoretical 
understanding of the non-linear dynamics that govern structure for- 
mation heavily relies on numerical simulations. 

For instance, a measurement of the baryonic acoustic oscil- 
lation (BAO) dilation scale can provide tight constraints on the 
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dark energy equation of sta t e feisenstein et alj200 5 ; Tegma rket al.l 
120061 : |Percivaletai]|200l ISchlegel et al.l l2009h . The most opti- 
mal estimates of the uncertainty requires the knowledge of the 
matter power spectrum covariance matrix, which is only accu- 
rate when measured from a l arge sample of N-body sim ula- 
tions fames & Hamiltonll2005l ; iTakahashi et"ai]|2009L l201ll) . For 
the same reasons, the most accurate estimates of weak gravi- 
tational lensing signals are obtained by propagating photons in 
past light cones that are extracted from simulated den sity fields 
dVale & Whitel2003l : ISato et alj2009ljHilbert et alj|2009l) . Another 
area where large-scale N-body simulations have been instrumental 
in recent years ar e in simulations of early cosmic s t ructures and 



reionization (e.g. Illiev et al J 120061 : IZahn et alj [20071 ; iTrac & Cenl 



l2007l ; llliev et alj2012h . The reionization process is primarily driven 
by low-mass galaxies, which for both observational and theoretical 
reasons, need to be resolved in fairly large volumes, which demands 
simulations with a large dynamical range. 



The basic problem that is addressed with N-body codes is 
the time evolution of an ensemble of N particles that is subject to 
gravitational attraction. The brute force calculation requires 0(N 2 ) 
operations, a cost that exceeds the memory and speed of cur- 
rent machines for large prob lems. Solving the problem on a mesh 
dHocknev & EastwooJll98ll) reduces to O(NlogN) the number of 
operations, as it is possible to solve for the particle-mesh (PM) in- 
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teraction with fast Fourier transforms techniques that rely o n high 
performance libraries such as FFTW (Frigo & Johnson 2005). 

With the advent of large computing facilities, parallel compu- 
tations have now become common practice, and N-body codes have 
evolved both in performance and complexi ty. Many have opted 
for 'tree' algo rithms , includin g GADGET iSpringel et ail 1200 ll ; 
ISpringelll2005h . PMT dxdl 19951) . GOTPM JBubinski et al.ll2004l) . or 
adaptive P 3 M codes like Hydra dCouchman et alj|l995l) . in which 
the local resolution increases with the density of the matter field. 
These often have the advantage to balance the work load across the 
computing units, which enable fast calculations even in high den- 
sity regions. The drawback is a significant loss in speed, which can 
be only partly recovered by turning off the tree algorithm. T he same 
reasoning applies to mesh-refined codes dCouchman]|l99lh . which 
in the end are not designed to perform fast PM calculations on large 
scales. 

Although such codes are needed to study systems that are spa- 
tially heterogeneous like individual clusters or haloes, AGNs or 
other compact objects, many applications are interested in study- 
ing large cosmological volumes in which the matter distribution is 
rather homogeneous. In such environments, the load balancing al- 
gorithm can be removed, and the effort can thereby be put towards 
speed, memory compactness and scalability. PMFAST (Me rz et all 
(2005), MPT hereafter) was one of the first code designed specif- 
ically to optimize the PM algorithm, both in terms of speed and 
memory usage, and uses a two-leve l mesh algorithm based on the 
gravity solver o f lTrac & Pen! J2003T) . The long range gravitational 
force is computed on a grid 4 3 times coarser, such as to mini- 
mize the MPI communication time and to fit in system's memory. 
The short range is computed locally on a finer mesh, and only 
the local sub-volume needs to be stored at a given time, allow- 
ing for OPENMP parallel computation. This setup enables the code 
to evolve large cosmological systems both rapidly and accurately, 
even on relatively modest clusters. One of the main advantages of 
PMFAST over other PM codes is that the number of large arrays is 
minimized, and the global MPI communications are cut down to the 
minimum: for passing particles at the beginning of each time step, 
and for computing the long range FFTs. 

Since its first published version, PMFAST has evolved in many 
aspects. The first major improvement was to transform the volume 
decomposition in multi-node configurations from slabs to cubes. 
The problem with slabs is that they do not scale well to large runs: 
as the number of cells per dimension increases, the thickness of 
each slab shrinks rapidly, until it reaches the hard limit of a single 
cell layer. With this enhancement, the code name was changed to 
CUBEPM. Soon after, it incorporated particle-particle (pp) interac- 
tions at the sub-grid level, and was finally renamed CUBEP 3 M. The 
public package now includes a significant number of new features: 
the pp force can be extended to an arbitrary range, the size of the 
redshift jumps can be constrained for improved accuracy during the 
first time steps, a runtime halo finder has been implemented, the ex- 
pansion has also been generalized to include a redshift dependent 
equation of state for dark energy, there is a system of unique parti- 
cle identification that can be switched on or off, and the initial con- 
dition generator has been generalized as to include non-Gaussian 
features. PMFAST was equipped with a multi-time stepping option 
that has not been tested on CUBEP 3 M yet, but which is, in principle 
at least, still available. 

The standard package also contains support for g as cosmo- 



logica l evolution through a portable TVD-MHD module ( P en et al 



2003) that scales up to thousands of cores as well (see lPang et al 



with the radiative transfer code C -ray iMellema et aT] [2006). 
CUBEP 3 M is therefore one of the most competitive and versa- 
tile public N-body code, and has been involved in a number of 
scientific applicat ions over the last few years, spanning the field 
of weak lensing dVafaei et al l 1201 C : iLu & Pen! 120081: iDore et all 



l2009tlLuetafl2Qlct 
BAO Zhang et al. 



Yu ct al 



2011 



2012 



Ngan et al 



Harnois-Deraps et al. 2012a 



20121 ; lHarnois-Deraps & Pert 
120121 : lHarnois-D eraps et al. l2012bl). fo rmation of early cos- 



mic s tructures ( iliev et al. 200 81. l2010f). obs ervations of dark 
stars fcackrisson et alj |201Ct lllie et alj l2012t) and reionization 



1 Iliev et al 



Mao ct al 



2012; 
20121 : 



j_ Fernandez et al.l \2 
Dattaetalj|20i 2b a) 



20101) and footnote 4 in section |4j, and a coupling interface 



20121 : iFriedrich et al.1 l201ll : 
Continuous efforts are be- 
ing made to develop, extend and improve the code and each of its 
utilities, and we expect that this will pave the way to an increasing 
number of science projects. Notably, the fact that the force calcula- 
tion is multi-layered makes the code extendible, and opens the pos- 
sibility to run CUBEP 3 M on hybrid CPU-GPU clusters. It is thus im- 
portant for the community to have access to a paper that describes 
the methodology, the accuracy and the performance of this public 
code. 

Since CUBEP 3 M is not new, the existing accuracy and sys- 
tematic tests were performed by different groups, on different ma- 
chines, and with different geometric and parallelization configura- 
tions. It is not an ideal situation in which to quantify the perfor- 
mance, and each test must therefore be viewed as a separate mea- 
surement that quantifies a specific aspect of the code. We have tried 
to keep to a minimum the number of such different test runs, and 
although the detailed numbers vary with the problem size and the 
machines, the general trends are rather universal. 

Tests on constraints of the redshift step size and on improve- 
ments of the force calculations were performed on the Sunnyvale 
beowulf cluster at the Canadian Institute for Theoretical Astro- 
physics (CITA). Each node contains 2 Quad Core Intel(R) Xeon(R) 
E5310 1.60GHz processors, 4GB of RAM, a 40 GB disk and 2 
gigE network interfaces. These tests were performed on the same 
cluster, but with different box sizes, starting redshifts and parti- 
cle numbers. Hereafter we refer to these simulation sets as the 
CITA configurations, and specify which parameters were used in 
each case. Some co mplimentary runs were also performed on the 
SciNet GPC cluster fcoken e t al. 2010), a system of IBM iDataPlex 
DX360M2 nodes equipped with Intel Xeon E5540 cores running at 
2.53GHz with 2GB of RAM per core. For tests of the code accu- 
racy, of the non-Gaussian initial conditions generator and of the 
run time halo finder algorithm, we used a third simulation configu- 
ration series that was run at the Texas Advanced Computing Centre 
(TACC) on Ranger, a SunBlade x6420 system with AMD x86.64 
Opteron Quad Core 2.3 GHz 'Barcelona' processors and Infiniband 
networking. These RANGER4000 simulations evolved 4000 3 par- 
ticles from z = 100 to z = with a box side of 3.2/r'Gpc on 4000 
cores. 

The paper is structured as follow: section[2]reviews the struc- 
ture and flow of the main code; section [3] describes how Poisson 
equation is solved on the two-mesh system; we then present in sec- 
tion [4] the scaling of the code to very large problems, including 
much larger runs that were produced at various high performance 
computing centres; section [5] discuss the accuracy and systematic 
effects of the code. We then describe in section|6]the run-time halo 
finder, in section [7] various extensions to the default configuration, 
and conclude afterwards. 
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2 REVIEW OF THE CODE STRUCTURE 

An optimal large scale N-body code must address many challenges: 
minimize the memory footprint to allow larger dynamical range, 
minimize the passing of information across computing nodes, re- 
duce and accelerate the memory accesses to the large scale arrays, 
make efficient use of high performance libraries to speed up stan- 
dard calculations like Fourier transforms, etc. In the realm of paral- 
lel programming, high efficiency can be assessed when a high load 
is balanced across all processors most of the time. In this section, 
we present the general strategies adopted to address these chal- 
lengefl We start with a walkthrough the code flow, and briefly 
discuss some specific sections that depart from standard N-body 
codes, while referring the reader to future sections for detailed dis- 
cussions on selected topics. 

As mentioned in the Introduction section, CUBEP 3 M is a FOR- 
TRAN90 N-body code that solves Poisson's equation on a two-level 
mesh, with sub-cell accuracy thanks to particle-particle interac- 
tions. The code has extensions that departs from this basic scheme, 
and we shall come back to these later, but for the moment, we adopt 
the standard configuration. The long range component of the grav- 
ity force is solved on the coarse grid, and is global in the sense that 
the calculations require knowledge about the full simulated vol- 
ume. The short range force and the particle-particle interactions are 
computed in parallel on a second level of cubical decomposition 
of the local volumes, the tiles. To make this possible, the fine grid 
arrays are constructed such as to support parallel memory access. 
In practice, this is done by adding an additional dimension to the 
relevant arrays, such that each CPU accesses a unique memory lo- 
cation. The force matching between the two meshes is performed 
by introducing a cutoff length, r c , in the definition of the two force 
kernels. The value of r c = 16 fine cells was found to balance the 
communication overhead between processes and the accuracy of 
the match between the two meshes. 

The computation of the short range force requires each tile to 
store the fine grid density of a region that includes a buffer surface 
around the physical volume it is assigned. The thickness of this sur- 
face must be larger than r c , and we find that a 24 cells deep buffer 
is a good compromise between memory usage and accuracy. This 
fully compensates for the coarse mesh calculations, whose CIC in- 
terpolation scheme reaches two coarse cells deep beyond the cutoff. 
When it comes to finding haloes at run time, this buffer can create 
a problem, because a large object located close to the boundary can 
have a radius larger than the buffer zone, in which case it would 
be truncated and be assigned a wrong mass, centre of mass, etc. 
It could then be desirable to increase the buffer zone around each 
tile, at the cost of a loss of memory dedicated to the actual physical 
volume, and the code is designed to allow for such a flexibility. 

Three important processes that speed up the code were already 
present in PMFAST: 1) access to particles is accelerated with the use 
of linked lists, 2) deletion of 'ghost' particles in buffer zones is done 
at the same time as particles are passed to adjacent nodes, and 3) 
the global FFTs are performed with a slab decomposition of the 
volumes via a special file transfer interface, designed specifically 
to preserve a high processor load. 

Because the coarse grid arrays require 4 3 times less memory 
per node, it does not contribute much to the total memory require- 
ment, and the bulk of the foot-print is concentrated in a handful of 

Many originate directly from MPT and were preserved in CUBEP 3 M; 
those will be briefly mentioned, and we shall refer the reader to the original 
PMFAST paper for greater details. 



fine grid arrays. Some of these are only required for intermediate 
steps of the calculations, hence it is possible to hide therein many 
coarse grid arrayj|. We present here the largest arrays used by the 
code: 

(i) xv stores the position and velocity of each particle 

(ii) 11 stores the linked-list that accelerate the access to particles 
in each coarse grid cell 

(iii) send_buf and recv.buf store the particles to be MP1- 
passed to the neighbouring sub-volumes 

(iv) rho_f and cmplx_rho_f store the local fine grid density in 
real and Fourier space respectively 

(v) force_f stores the force of gravity (short range only) along 
the three Cartesian directions 

(vi) kern_f stores the fine grid force kernel in the three direc- 
tions 

(vii) PID stores the unique particle identification tags as double 
integers. 

(viii) send_buf_PID and recv_buf_PID store the ID to be MPI- 
passed to the neighbouring sub- volumes 

The particle ID is a feature that can be switched off simply by 
removing a compilation flag, and allows to optimize the code for 
higher resolution configurations. 

In terms of memory, CUBEP 3 M can be configured such as the 
main footprint is dominated exclusively by the particle xv array and 
the linked list, in which case it asymptotes to about 28 bytes per par- 
ticles. For a fixed problem size, this can be achieved by maximizing 
the number of tiles, thereby reducing the size of the local fine mesh 
arrays. F or comparison, Lean-GADGET-2 code uses 84 bytes per 
particles dSpringelluOO !) . which is about three times more. There 
is a limit at which we can do such a volume breakdown, though, 
since the physical volume on each tile must be at least twice as 
large as the buffer, i.e. > 96 3 cells (see section|3]l. This allows a low 
rate of particle exchange across nodes; in the default configuration, 
the send_buf and recv_buf arrays are much larger, in which case 
the memory per particle goes up to 44 bytes. Enabling the parti- 
cle ID adds at least 8 bytes per particle from the main array, and 
as the MPI-communication buffers are allocated larger sizes, they 
contribute at most an additional 16 bytes as well. Hence a 'lean' 
version with the particle ID tags uses down to 36 bytes per par- 
ticles. In practice, minimizing the memory does not optimize the 
speed, as many CPU-intensive operations are performed on each 
tile - the FFTW for instance. A higher number of cells per tile - 
typically 176 3 - with fewer tiles usually runs faster, in which case 
the other arrays in the above-mentioned list become significantly 
larger, and the memory per particle is closer to 100 - 140 bytes. 

The code flow is presented in Fig. [T] and [2] Before entering 
the main loop, the code starts with an initialization stage, in which 
many declared variables are assigned default values, the redshift 
checkpoints are read, the FFTW plans are created, and the MPI com- 
municators are defined. The xv array is obtained from the out- 
put of the initial conditions generator (see section I7.lt . and the 
force kernels on both grids are constructed by reading precomputed 
kernels that are then adjusted to the specific simulation size. For 
clarity, all these operations are collected under the subroutine call 
initialize in Fig.[T] although they are actually distinct calls in 
the code. 

Each iteration of the main loop starts with the timestep sub- 

2 This memory recycling is done with 'equivalence' statements in FOR- 
TRAN^). 
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program cubep3m 
call initialize 
do 

call timestep 

call particle_mesh 

if (checkpoint_step) then 

call checkpoint 
elseif (last_step) 

exit 
endif 
enddo 

call finalize 
end program cubep3m 

Figure 1. Overall structure of the code simplified for readability. 



routine, which proceeds to a determination of the redshift jump by 
comparing the step size constraints from each force components 
and from the scale factor. The cosmic expansion is found by Tay- 
lor expanding Friedmann's equation up to the third order in the 
scale factor, and can accommodate constant or running equation 
of state of dark energy. The force of gravity is then solved in the 
particle_mesh subroutine, which first updates the positions and 
velocities of the dark matter particles, exchange with neighbour- 
ing nodes those that have exited to volume, creates a new linked 
list, then solve Poisson's equation. This subroutine is conceptually 
identical to that of PMFAST, with the exceptions that CUBEP 3 M de- 
composes the volume into cubes (as opposed to slabs), and com- 
putes pp interactions as well. The loop over tiles and the particle 
exchange are thus performed in three dimensions. When the short 
range and pp forces have been calculated on all tiles, the code exits 
the parallel OPENMP loop and proceeds to the long range calcu- 
lation. This section of the code is also parallelized on many oc- 
casions, but, unfortunately, the current MPI-FFTW libraries do not 
allow multi-threading. There is thus an inevitable loss of efficiency 
during each global Fourier transforms, during which only the sin- 
gle core MPI process is activ^. As in PMFAST, the particle po- 
sition and velocity updates are performed in a leap frog scheme 
l lHocknev & Eastwoorj|l98lh . 

If the current redshift corresponds to one of the checkpoints, 
the code advances all particles to their final location and writes 
them to file. These restart files begin with a small header that con- 
tains the local number of particles, the current redshift and time 
step, and the constraints on the time step jump from the previous it- 
eration; the header is followed by the position and velocity of each 
particle. The particle identification tags are written with a similar 
fashion in distinct files to simplify the post-processing coding. This 
general I/O strategy allows for highly efficient MPI parallel read 
and write, by default in binary format for compactness, and has 
been shown to scale well to large data sets. At this stage, the per- 
formance depends only on the setup and efficiency of the parallel 
file system. 

Similarly, the code can compute two-dimensional projections 
of the density field, halo catalogues (see section[6]for details), and 
can compute the power spectrum on the coarse grid at run time. The 
code exits the main loop when it has reached the final redshift, it 
then wraps up the FFTW plans and clears the MPI communicators. 



3 Other libraries such as P3DFFTlhttp: //code . google . com/p/p3dfft/l 
currently permit an extra level of parallelization, and it is our plan to migrate 
to one of these in the near future. 



subroutine particle_mesh 

call update_position + apply random offset 
call link_list 
call particle_pass 
! $omp parallel do 
do tile = 1, tiles_node 
call rho_f_ngp 
call cmplx_rho_f 
call kernel_multiply_f 
call force_f 
call update_velocity_f 
if(pp = .true.) then 
call link_list_pp 
call force_pp 
call update_velocity_pp 
if (extended_pp = .true.) then 
call link_list_pp_extended 
call force_pp_extended 
call update_velocity_pp_extended 
endif 
endif 
end do 

! $omp end parallel do 
call rho_c_ngp 
call cmplx_rho_c 
call kernel_multiply_c 
call force_c 
call update_velocity_c 
delete_buffers 
end subroutine particle_mesh 

Figure 2. Overall structure of the two-level mesh algorithm. We have in- 
cluded the section that concerns the standard pp and the extended pp force 
calculation, to illustrate that they follow similar linked-list logic. 

We have collected those operations under the subroutine finalize 
for concision. 

Other constraints that need to be considered is that any decom- 
position geometry limits in some ways the permissible grid sizes, 
and that volumes evenly decomposed into cubic sub-sections - such 
as CUBEP 3 M - require the number of MPI processes to be a perfect 
cube. Also, since the decomposition is volumetric, as opposed to 
density dependent, it suffers from load imbalance whenever the par- 
ticles are not evenly distributed. However, large scale cosmology 
problems are very weakly affected by this effect, which can gener- 
ally be overcome by allocating a small amount of extra memory per 
node. As mentioned in section|4] large runs can even be optimized 
by allocating less memory to start with, follow by a restart with a 
configuration that allows more memory per node, if required. 



3 POISSON SOLVER 

This section reviews how Poisson's equation is solved on a double- 
mesh configuration. Many parts of the algorithm are identical 
to PMFAST, hence we refer the reader to section 2 of MPT for 
more details. In CUBEP 3 M, the mass default assignment scheme 
are a 'cloud-in-cell' (CIC) interpolation for the coarse grid, 
and a 'nearest-grid-point' ( NGP) interpolation for the fine grid 
dHocknev & Eastwood[l98lh . This choice is motivated by the fact 
that the most straightforward way to implement a P 3 M algorithm 
on a mesh is to have exactly zero mesh force inside a grid, which is 
only true for the NGP interpolation. Although CIC generally has a 
smoother and more accurate force, the pp implementation enhances 
the code resolution by almost an order of magnitude. 
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The code units inherit from frrac & Perj|200j) and are sum- 
marized here for completeness. The comoving length of a fine grid 
cell is set to one, such that the unit length in simulation unit is 



l£ = a- 

N 



(1) 



where a is the scale factor, N is the total number of cells along one 
dimension, and L is the comoving volume in /r'Mpc. The mean co- 
moving mass density is also set to unity in simulation units, which, 
in physical units, corresponds to 



-3 -3 30 m //^* 

ID = p„,(0)a ~ = Cl„,p c a = 



(2) 



Q.„, is the matter density, H„ is the Hubble's constant, p c is the crit- 
ical density today, and G is Newton's constant. The mass unit is 
found with M. = OX 3 . Specifying the value of G on the grid fixes 
the time unit, and with G g ,-id = l/(6na), we get: 



IT = 



2a 2 



1 



(3) 



These choices completely determine the convertion between phys- 
ical and simulation units. For instance, the velocity units are given 
by W = LIT. 

The force of gravity on a mesh can be computed either with a 
gravitational potential kernel a>^(x) or a force kernel oj f (x). Gravity 
fields are curl-free, which allows us to relate the potential 0(x) to 
the source term via Poisson's equation: 

V 2 0(x) = AnGp(x) (4) 

We solve this equation in Fourier space, where we write: 

- (k) = -4nGp(k) s ^ (k) - (k) (5) 

The potential in real space is then obtained with an inverse Fourier 
transform, and the kernel becomes ai^(x) = -G/r, with r = |x|. 
Using the convolution theorem, we can write 

(p(x) = f p(x / )<^(x / - x)rfx' 

and 

F(x) = -mV0(x) 



(6) 



(7) 



Although this approach is fast, it involves a finite differentiation at 
the final step, which enhances the numerical noise. We therefore 
opt for a force kernel, which is more accurate, even though it re- 
quires four extra Fourier transforms. In this case, we must solve 
the convolution in three dimensions and define the force kernel co F 
such as: 



F(x) = 



|p(x< 



)<u F (x' - x)dx' 



(8) 



Because the gradient acting on Eq.|6]affects only unprime variables 
we can express the force kernel as a gradient of the potential kernel 
Namely: 

mGr 



o F (x) = -VtjJx) = -- 



(9) 



Following the spherically symmetric matching technique of 
MPT (section 2.1), we split the force kernel into two components, 
for the short and long range respectively, and match the overlapping 
region with a polynomial. Namely, we have: 



\to F (r)-p(r) ifr<r c 







otherwise 



(10) 



and 



w,(r) 



p{r) if r < r c 
(o F (r) otherwise 



(11) 



The vector /?(r) is related to the fourth order polynomial that is 
used in the potential case described in MPT by f) = — Var(r). The 
coefficients are found by matching the boundary conditions at r c up 
to the second derivative, and we get 
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(12) 



Since these calculations are performed on two grids of dif- 
ferent resolution, a sampling window function must be convoluted 
both with the density and the kernel (see [Eq. 7-8] of MPT). When 
matching the two force kernels, the long range force is always on 
the low side close to the cutoff region, whereas the short range 
force is uniformly scattered across the theoretical l/r 2 value - in- 
trinsic features of the CIC and NGP interpolation schemes respec- 
tively. By performing force measurements on two particles ran- 
domly placed in the volume, we identified a small region surround- 
ing the cutoff length in which we empirically adjust both kernels 
such as to improve the match. Namely, for 14 < r < 16 in fine cell 
units, u> s (r) — > 0.985<u,(r), and for 12 < r < 16, <u/(r) — » 1.2a>i(r). 
A recent adjustment of the similar kind was also applied on the 
fine kernel to compensate for a systematic underestimate of the 
force at fine grid distances. This is caused by the uneven balance of 
the force about the 1 /r 2 law in the NGP interpolation scheme. Al- 
though most figures in this paper were made prior this correction, 
the new code configuration applies a 90 per cent boost to the kernel 
for the six elements that are adjacent, i.e. a> s (r = 1) — * l.9(o s (r = 1) 
(see section [5/2] for more details). As mentioned in section [2] and 
summarized in Fig. [2] the force kernels are first read from files in 
the code initialization stage. Eq. [8]is then solved with fast Fourier 
transforms along each direction, and is applied onto particles in the 
update.velocity subroutine. 

The pp force is calculated during the fine mesh velocity up- 
date, which avoids loading the particle list twice and allows the 
operation to be threaded without significant additional work. Dur- 
ing this process, the particles within a given fine mesh tile are first 
read in via the linked list, then their velocity is updated with the 
fine mesh force component, according to their location within the 
tile. In order to organize the particle-particle interactions, we pro- 
ceed by constructing a set of threaded fine-cell linked list chains 
for each coarse cell. We then calculate the pairwise force between 
all particles that lie within the same fine mesh cell, excluding pairs 
whose separation is smaller than a softening length r so f, \ particles 
separated by less than this distance thus have their particle-particle 
force set to zero. As this proceeds, we accumulate the pp force ap- 
plied on each particle and then determine the maximum force ele- 
ment of the pp contribution, which is also taken into account when 
constraining the length of the global time step. 

Force softening is generally required by any code to prevent 
large scattering as r — > that can otherwise slow the calcula- 
tion down, and to reduce the two-body relaxation, which can affect 
the numerical convergence. Many force softening schemes can be 
found in the literature, including Plummer force, uniform or linear 
density profiles or the spline-softened model (see lDver & id ( fl993h 
for a review). In the current case, a sharp force cutoff corresponds 
to a particle interacting with a hollow shell. In comparison with 
other techniques, this force softening is the easiest to code and the 
fastest to execute. Generally, it is desirable to match the smooth- 
ness of the force to the order of the time integration. A Plummer 
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Figure 3. Dark matter power spectrum, measured at z = 0.5 in a series of 
SciNet256 simulations in which the softening length is varied. The simula- 
tions started off at z = 100, and have a box size of lOOft^'Mpc. 



force is infinitely differentiable, which is a sufficient but not neces- 
sary condition for our 2'"' order time integration. Also, one of the 
drawbacks of Plummer's softening is that the resolution degrades 
smoothly: the effects of the smoothing are present at all radii. In 
comparison, the uniform density and hollow shell alternatives both 
have the advantage that the deviations from 1/r 2 are minimized. Al- 
though all the results presented in this paper were obtained with the 
sharp cutoff softening, other schemes can easi ly be adopted as th ese 
are typically single line changes to the code. iDver & Id ( fl993h ar- 
gues in favour of the uniform density profile scheme - which is 
more physical - and future developments of CUBEP 3 M will incor- 
porate this option. 

The choice of softening length is motivated by a trade off be- 
tween accuracy and run time. Larger values reduce the structure 
formation but make the code run quicker. We show in Fig. [3] the 
impact of changing this parameter on the power spectrum. For this 
test, we produced a series of SciNet256 simulations, each starting 
at z = 100 and evolving to z = 0.5, with a box size of 100/r'Mpc. 
They each read the same set of initial conditions and random seeds, 
such that the only difference between them is the softening length. 
We record in Table Q] the real time for each trials, from where we 
witness the strong effect of the choice of r so f, on the computation 
time. In this test, which is purposefully probing rather deep in the 
non-linear regime, reducing the softening length to half its default 
value of l/10' /! of a fine grid cell doubles the run time, while the 
gain in power spectrum is less than two per cent. Similarly, increas- 
ing r so f, to 0.2 reduces the run time almost by a half, but suffers 
from a five per cent loss in power at the resolution turn-around. 
One tenth of grid cell seems to be the optimal choice in this trade 
off, however it should really be considered a free parameter. 

In addition, PMFAST could run with a different set of force 
kernels, described in MPT as 'least square matching', a technique 
which adjusts the kernels on a cell-by-cell basis such as to minimize 
the deviation with respect to Newtonian predictions. This was orig- 
inally computed in the case where both grids are obtained from CIC 
interpolation. Moving to a mix CIC/NGP scheme requires solv- 
ing the system of equations with the new configuration, a straight- 
forward operation. With the inclusion of the random shifting (see 
section [5721 . however, it is not clear how much improvement one 



Table 1. Scaling in CPU resources as a function of the softening length, in 
units of fine grid cells. 
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time (h) 
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0.3 
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0.2 


10.09 


0.1 


18.67 


0.05 


31.07 



would recover from this other kernel matching. It is certainly some- 
thing we will investigate and document in the near future. 

Finally, a choice must be done concerning the longest range of 
the coarse mesh force. Gravity can be either a) a 1/r 2 force as far 
as the volume allows, or b) modified to correctly match the period- 
icity of the boundary conditions. By default, the code is configured 
along to the second choice, which accurately models the growth of 
structures at very large scales. However, detailed studies of gravita- 
tional collapse of a single large object would benefit from the first 
setting, even though the code is not meant to evolve such systems 
as they generally require load balance control. 



4 SCALING PERFORMANCES 

The parallel algorithm of CUBEP 3 M is designed for 'weak' scal- 
ing, i.e. if the number of cores and the problem size increase in 
proportion to each other, then for ideal scaling the wall-clock time 
should remain the same. This is to be in contrasted with 'strong' 
scaling codes, whereby the same problem solved on more cores 
should take proportionately less wall-clock time. This weak scaling 
requirement is dictated by the problems we are typically investigat- 
ing (very large and computationally-intensive) and our goals, which 
are to address such large problems in the most efficient way, rather 
than for the least wall-clock time. Furthermore, we recall that there 
is no explicit load balancing feature, thus the code is maximally 
efficient when the sub-domains contain roughly an equal number 
of particles. This is true for most cosmological-size volumes that 
do not resolve too deep in the non-linear regime, but not for e.g. 
simulations of a single highly-resolved galaxy. 

We first recall that because of the volumetric decomposition, 
the total number of MPI processes needs to be a perfect cube. Also, 
for maximal resource usage, the number of tiles per node should 
be a multiple of the number of available CPUs per MPI process, 
such that no core sits idle in the threaded block. Given the avail- 
able freedom in the parallel configuration, as long as the load is 
balanced, it is generally good practice to maximize the number of 
OPENMP threads and minimize the number of MPI processes: the 
information exchange between cores that are part of the same moth- 
erboard is generally much faster. In addition, having fewer MPI 
processes reduces the total amount of buffer zones, freeing mem- 
ory that can be used to increase the mesh resolution. However, it 
has been observed that for the case of non-uniform memory access 
(NUMA) systems, the code is optimized when only the cores that 
share the same socket are OPENMP threaded. As one probes deeper 
into the non-linear regime however, the formation of dense objects 
can cause memory problems in such configurations, and increas- 
ing the number of MPI processes helps to ensure memory locality, 
especially in NUMA environments. 

The intermediary version of the code - CUBEPM - was ported 
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Figure 4. Scaling of CUBEP 3 M on Curie fat nodes (left) and on Ranger TACC facility for very large number of cores (right). Plotted is the code speedup 
("particles /' wallclock ) a g a i ns t core count, normalized by the smallest run in each case. The dashed line indicates the ideal weak scaling, and details on the data 
are listed in Table [2] 

Table 2. Scaling of CUBEP 3 M on Curie. Speedup is scaled to the smallest run. 
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Table 3. Scaling of CUBEP 3 M on Ranger. Speedup is scaled to the smallest run. 
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speedup 


ideal speedup 


absolute timing (min) 
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box size (h 1 Mpc) 
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258 


1728 3 
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4096 


4.53 


4.74 
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3072 3 


11.4 


10976 


9.78 


12.7 


845 


5488 3 


20 


21952 


14.73 


25.4 


561 


5488 3 


20 



to the IBM Blue Gene/L platform, and achieved weak-scaling up 
to 4096 processes (over a billion particles), with the N-body cal- 
culation incurring only a 10 per cent overhead at runtime (com- 
pared to 8 processes) for a balanced workloacJE In order to ac- 
commodate the limited amount of memory available per process- 
ing core on the Blue Gene/L platform machines, it was necessary 
to perform the long ran ge MPI FFT with a volumetric decomposition 
(Eleftheriou et al. 2005). Slab decomposition would have required 
a volume too large to fit in system memory given the constraints in 
the simulation geometry. 

The scaling of CUBEP 3 M was first tested with a dedicated 
series of simulations - the CURIE simulation suite- by increas- 
ing the size and number of cores on the 'fat' (i.e. large-memory) 
nodes of the Curie supercomputer at the Tres Grand Centre de 
Calcul (TGCC) in France. For appropriate direct comparison, all 
these simulations were performed using the same particle mass 



4 http : //web . archive . org/web/2006092 5132 146/ 
http: //www-03 . ibm.com/servers/deepcomputing/pdf/ 
Blue_Gene_Applications_Paper_CubePM_032306.pdf 



(Afpartide = 1-07 x 10" M Q ) and force resolution (softening length 
50 /r'kpc). The box sizes used range from 256 /j~'Mpc to 2048 
/r'Mpc, and the number of particles from 256 3 to 2048 3 . Simula- 
tions were run on 32 up to 2048 computing cores, also starting from 
redshift z = 100, and evolving until z = 0. Our results are shown in 
Fig.|4]and in Table[2] and present excellent scaling, within ~ 3 per 
cent of ideal, at least for up to 2048 cores. 

We have also ran CUBEP 3 M on a much larger number of cores, 
from 8000 to up to 21,976, with 5488 3 -6000 3 (165 to 216 billion) 
particles on Ranger and on JUROPA at the Jiilich Supercomputing 
Centre in Germany, which is an Intel Xeon X5570 (Nehalem-EP) 
quad-core 2.93 GHz system, also interconnected with Infiniband. 
Since it is not practical to perform dedicated scaling tests on such a 
large number of computing cores, we instead list in Table|3]the data 
directly extracted from production runs. We have found the code 
to scale within 1.5 per cent of ideal up to 4096 cores. For larger 
sizes (> 10,976 cores), the scaling is less ideal, due to increased 
communication costs, I/O overheads (a single time slice of 5488 3 
particles is 3.6 TB) and load balancing issues, but still within ~ 20 
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per cent of ideal. These first three Ranger runs were performed with 
4 MPI processes and 4 threads per Ranger node ('4way'jfl 

Furthermore, due to the increasing clustering of structures at 
those small scales, some of the cuboid sub-domains came to con- 
tain a number of particles well above the average, thereby requiring 
more memory per MPI process in order to run until the end. As a 
consequence, throughout most of their late evolution, the largest 
two of these simulations were run with 4096 and 21,952 cores and 
with only 2 MPI processes and 8 threads per node ('2way'), which 
on Ranger allows using up to 16 GB of RAM per MPI procesfl 
Because each processor accesses memory that is not fully local, 
this configuration does affect the performance somewhat, as does 
the imperfect load balancing that arises in such situations. This 
can be seen in the rightmost point of Fig. [4] (right panel), where 
the scaling is 42 per cent below ideal. We note that we still get 
~ 1.5 speedup from doubling the core count, even given these is- 
sues. Overall the code scaling performance is thus satisfactory even 
at extremely large number of cores. We expect the code to handle 
even larger problems efficiently, and is thus well suited to run on 
next generation Petascale systems. 

Finally, we note that several special fixes had to be developed 
by the TACC and JUROPA technical staff in order for our largest 
runs to work properly. In particular, we encountered unexpected 
problems from software libraries such as MPICH and FFTW when 
applied to calculations of such unprecedented size. 



5 ACCURACY AND SYSTEMATICS 

This section describes the systematics effects that are inherent to 
our P 3 M algorithm. We start with a demonstration of the accuracy 
with a power spectrum measurement of a RANGER4000 simula- 
tion. The halo mass function also assess the code capacity to model 
gravitational collapse, but depends on the particular halo finder 
used. We thus postpone the discussion on this aspect until section 
|6] and focus for the moment on the particles only. 

We then quantify the accuracy of the force calculation with 
comparisons to Newton's law of gravitation. Since most of the sys- 
tematics come from the contribution at smaller distances, we ded- 
icate a large part of this section to the force calculation at the grid 
scale. We then explain how the correction on the fine force kernel 
at r = 1 was obtained, and present how constraining the size of the 
redshift jumps can help improve the accuracy of the code. 

5.1 Density and power spectrum 

One of the most reliable ways to assess the simulation's accuracy 
at evolving particles is to measure the density power spectrum at 
late time, and compare to non-linear prediction. For an over-density 
field S(x), the power spectrum is extracted from the two point func- 
tion in Fourier space as: 

(\S(k)S(k')\) = {2nfP{k)S D {k' - k) (13) 

5 For these very large runs, we used a NUMA script taccMffinity, specially 
provided by the technical staff, that bind the memory usage to local sockets, 
thus ensuring memory affinity. This becomes important because the mem- 
ory sockets per node (32 GB RAM/node on Ranger) are actually not equal- 
access. Generally, the local memory of each processor has much shorter 
access time. 

6 In order to ensure local memory affinity, a second special NUMA control 
script, taccMffinitySway, was developed for us by the TACC technical staff 
and allowed to ran more efficiently in this mode. 



where the angle bracket corresponds to an ensemble (or volume) 
average. Fig. [5] presents a 2-dimensional density projection of a 
RANGER4000 simulation, which evolved 4000 3 particles until 
redshift zero. We then plot in Fig.|6]its dimensionless power spec- 
trum, defined as A 2 (k) = k 3 P(k)/(2n 2 ) , and observe that the agree- 
ment with the non-linear prediction o f lSmifh etaljj2003l) is within 
five per cent up to k = l.OfcMpc -1 . The code exhibits a ~ 5 per cent 
over estimate compared to the theory for 0.2 < k < 0.6/7Mpc~', 
an effect generally caused by inaccuracies at the early time steps. 
It can be removed by starting the code at a later redshift, but then 
the code has less time to relax, which reduces the accuracy on other 
quantities like the halo mass functions or the four point functions. 
The drop of power at higher k is partly caused by the finite mesh 
resolution, partly from expected deviations about the predictions. 
The fluctuations at low-k come from the white noise impos ed in 
the initial conditions, and it was show in iNgan et alj j2012h and 
lHarnois-D6raps & Pen! d2012l) that samples of a few hundreds of 
realizations average to the correct value. 

5.2 Mesh force at grid distances 

The pairwise force test presented in MPT was carried by placing 
two particles at random locations on the grid, calculating the force 
between them, then iterating over different locations. This test is 
useful to quantify the accuracy on a cell-by-cell basis, but lacks the 
statistics that occur in a real time step calculation. The actual force 
of gravity in the P 3 M algorithm, as felt by a single particle during 
a single step, is presented in the top left panel of Fig. [7] This force 
versus distance plot was obtained from a CITA128 realization, and 
the calculation proceeds in two steps: 1- we compute the force on 
each particle in a given time step. 2- we remove a selected particle, 
thus creating a 'hole', compute the force again on all particles, and 
record on file the force difference (before and after the removal) as 
a function of the distance to the hole. 

Particles in the same fine cell as the hole follow the exact 1 /r 2 
curve. The scatter at distances of the order of the fine grid is caused 
by the NGP interpolation scheme: particles in adjacent fine cells 
can be actually very close, as seen in the upper left region of this 
plot, but still feel the same mesh force at grid cell distances. This 
creates a discrepancy up to an order of magnitude, in loss or gain, 
depending on the location of the pair with respect to the centre of 
the cell. As the separation approaches a tenth of the full box size or 
so, the force on the coarse mesh scatters significantly about New- 
ton's law due to periodic boundary conditions. As mentioned at the 
end of section[3] the longest range of force kernel can either model 
accurately Newtonian gravity, or the structure growth of the largest 
modes, but not both. For the sake of demonstrating the accuracy of 
the force calculation, we chose the first option in this section, but 
this is not how the code would normally be used. 

The top right panel of Fig. [JJ shows the fractional error on 
the force along the radial direction (top) and the fractional tangen- 
tial contribution (bottom), also calculated from a single time step. 
Again, particles that are in the same cell have zero fractional er- 
ror in the radial direction, and zero tangential force. Particles feel 
a large scatter from the neighbouring fine mesh cells, but as the 
distance increase beyond five fine cells, the fractional error drops 
down to about 20 per cent. The transverse fractional error peaks 
exactly at the grid distance, and is everywhere about 50 per cent 
smaller than the radial counterpart. 

Although these scatter plots show the contributions from in- 
dividual particles and cells, it is not clear whether the mean radial 
force felt is higher or lower than the predictions. For this purpose, 
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Figure 5. Detail of a dark matter density 3.2/i~'Gpc per side, projected over 15/2~'Mpc and measured at z = in a RANGER4000 simulation. In the online 
version, particles are shown in pink, haloes in blue, and the colour scale is linear. The full projection is 64 times larger, but we show here only a sub-volume, 
which provides more details on the structure. 



we rebin the results in 50 logarithmically spaced bins and compute 
the mean and standard deviation. We show the resulting measure- 
ment in the middle left panel of Fig.[7] where we observe that on av- 
erage, there is a 50 per cent fractional error in the range 0.4 < r < 2, 
but the numerical calculations are otherwise in excellent agreement 
with Newton's law. Since the transverse force is, by definition, a 
positive number, and since we know that it averages out to zero 
over many time steps, we plot only the scatter about the mean, rep- 
resented in the figure by the solid line. The transverse scatter is 
smaller everywhere for r < 10. 

At grid scales, the fractional error is larger than PMFAST, 
largely due to the fact that the fine mesh force is performed with 
an NGP interpolation scheme - as opposed to CIC. This prescrip- 
tion is responsible for the larger scatter about the theoretical value, 
but, as mentioned earlier, NGP interpolation is essential to our im- 
plementation of the pp part. At the same time, the biggest problem 
with the straightforward pp force calculation is that the results are 
anisotropic and depend on the location of the fine mesh with respect 
to the particles. As an example, consider two particles on either side 
of a grid cell boundary, experiencing their mutual gravity attraction 
via the fine mesh force with a discretized one-grid cell separation. 
If, instead, the mesh was shifted such that they were within the 



same cell, they would experience the much larger pp force. This is 
clearly seen in the top left panel of Fig. [7] where particles physi- 
cally close, but in different grid cells, feel a force up to an order 
of magnitude too small. This effect is especially pronounced at the 
early stages of the simulation where the density is more homoge- 
neous, and leads to mesh artefacts appearing in the density field. 

In order to minimize these two systematic effects - the large 
scatter and the anisotropy - we randomly shift the particle distribu- 
tion relative to the mesh by a small amount - up to 2 fine grid cells 
in magnitude - in each dimension and at each time step. This adds 
negligible computational overhead as it is applied during the par- 
ticle position update, and suppresses the mesh behaviour that oth- 
erwise grows over multiple time steps. It is possible to shift back 
the particles at the end of each time step, which prevents a random 
drift of the whole population, a necessary step if one needs to cor- 
relate the initial and final positions of the particles for instance, or 
for hybrid dark matter - MHD simulations. 

We ensure that, on average, this solution balances out the mesh 
feature, by tuning the force kernels such as to provide a force as 
evenly balanced as possible, both at the cutoff length (r c = 16 fine 
cells) and at grid cell distances (see discussion in section[3}- Both 
of these adjustments are performed from the 'hole' and the pairwise 
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Figure 6. Dark matter power spectrum, measured at z, = in a RANGER4000 simulation, compared to the non-linear predictions of HALOFIT (calculated 
with the online CAMB toolkit). The vertical line corresponds to the scale of the coarse mesh, while the dotted line represents the Poisson noise. The particle 
masses are of 5.68 X 10'°A/ Q . 



force tests mentioned above. The bottom panels of Fig.[7jshow the 
effect of averaging the force calculation over ten time steps; the 
random offset is applied on the particles (and on the hole) at the 
end of each force calculation. We observe that the agreement with 
Newton's law is now at the per cent level for r > 1.5. At smaller 
distances, we still observe a scatter, but it is about two times smaller 
than during a single time step for r > 1. The force at sub-grid 
distances is, on average, biased on the low side by 20-40 per cent, 
as caused by the discretization effect of the NGP. 

We note that this inevitable loss of force in the NGP scheme 
is one of the driving arguments to extend the pp calculation outside 
the fine mesh cell, since the scattering about the actual 1/r 2 law 
drops rapidly with the distance. As discussed in section 1731 this 
gain in accuracy comes at a computational price, but at least we 
have the option to run the code in a higher precision mode. 

We present in Fig.[8]the dramatic impact of removing the ran- 
dom offset in the otherwise default code configuration. This test 
was performed with a CITA256 simulation of very large box size, 
the output redshifts are very early (the upper most curve at z = 10 
was obtained after only 60 time steps), such that the agreement 
with linear theory should extend up to the resolution limit. Instead, 
we observe that the power spectrum grows completely wrong, due 
to the large scatter in the force from the fine mesh, and to the 
anisotropic nature of the P 3 M calculations mentioned above. When 
these effects are not averaged over, the errors directly add up at each 
time step, which explains why later times are worst. We recall that 
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Figure 8. Dark matter power spectrum, measured at z, = 180, 100, 20 and 
10, in a CITA256 realization that is 1000/r 1 Mpc per side. The dashed 
line represents the initial condition power spectrum, the dotted lines are the 
linear predictions, and the open circles the standard P 3 M configuration. The 
dots are obtained by simply removing the random offset that is normally 
applied at each time step. 



PMFAST did not have this problem since it used CIC interpolation 
on both meshes. 

As mentioned in section [3] most of the tests presented in this 
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Figure 7. (top left:) Gravity force in the P 3 M algorithm, versus distance, compared with the exact 1/r 2 law. Distance is in fine mesh cell units, and force in 
simulation units. This particular calculation was obtained in a CITA128 realization with a box size of lOO/T'Mpc, in a single time step, (top right.) Fractional 
error on the force in the radial direction (top) and fractional tangential contribution (bottom). In a full simulation run, the scatter averages over many time steps, 
thanks to the inclusion of a random offset that is imposed on each particle, as discussed in the main text, (middle row.) Top row organized in 50 logarithmic 
bins; for the tangential contribution, we plot only the scatter about the mean (solid line), (bottom row.) Same as middle row, but averaged over ten time steps. 



paper were obtained without the fine force kernel correction at 
r = 1. For completeness, we present here the motivation behind 
the recent improvement.Without the correction, the force of gravity 
looks like Fig. [9] where we observe that the fractional error in the 
radial direction is systematically on the low side, by up to 50 per 
cent, even after averaging over 10 time steps. This is caused by the 



fact that in the strong regime of a 1/r 2 law, the mean force from 
a cell of uniform density, as felt by a particle, is not equivalent to 
that of the same cell collapsed to its centre - the actual position 
of equivalence would be at a closer distance. Basically, the idea is 
to boost the small distance elements of the NGP kernel by a small 
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Figure 9. Fractional error in the gravity force, without the fine force kernel 
correction at r = 1. This particular calculation was obtained in a OTA 128 
realization with a box size of 150/i~'Mpc, averaged over ten time steps. 



amount, which thereby corrects for the loss of force observed in 

Fig.m 

We show in Fig.[lO]the impact on the power spectrum of three 
different correction trials: a) the force from the first layer of neigh- 
bours is boosted by 90 per cent, b) the force from the first two 
layers of neighbours is boosted by 60 per cent, and c) the force 
from two layers is boosted by 80 per cent. We observe that with 
the trial a), we are able to improve the power at turn around by a 
factor of 1.5, which almost doubles the resolution. This test was 
performed in a series of SciNet256 simulations with a box length 
of 150/r'Mpc evolved until z = 0. We see from the figure that in 
the original configuration, the departure from the non-linear pre- 
dictions occur at k ~ 3.0/iMpc -1 , whereas the first trial allows for 
a similar resolution down to k ~ 4.5frMpc~' . The two other correc- 
tion schemes are more aggressive, and although they recover power 
at even smaller scales, they exhibit a small overestimate compared 
to non-linear predictions in the range 2.0 < k < 4.0/iMpc~ 1 , and 
more testing should be done prior to using these corrections. We 
have also tried to increase even more the boost factor in the case 
a) but once again, the boost started to propagate to larger scales, an 
effect we need to avoid. We therefore conclude that the fine force 
kernel should be subject to a boost factor following 'trial a', i.e. 
(o„(r= l)-> 1.9w s (r= 1). 

It is thinkable that other corrections could outperform this one, 
hence the kernel correction factor should also be considered as an 
adjustable parameter: 'no correction' is considered as the conser- 
vative mode, which is accurate only up to the turn around observed 
in the dimensionless power spectrum, and which has been used in 
most of the tests presented in this paper. Since higher resolution can 
be achieved at no cost with the corrected fine force kernel, it has 
now been adopted in the default configuration; 'trial b' and 'trial 
c' should be considered as too aggressive. We note, again, that we 
have not exhausted the list of possible correction, and that future 
work in that direction might result in even better resolution gain. 



5.3 Constraining redshift jumps 

At early stages of the simulation, the density field is homogenous, 
causing the force of gravity to be rather weak everywhere. In that 
case, the size of the redshift jumps is controlled by a limit in the 
cosmological expansion. If the expansion jump is too large, the size 
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Figure 10. Dark matter power spectrum, measured at z = 0, in a SciNet256 
realization that is 150/i~' Mpc per side. The thick solid line represents pre- 
dictions from CAMB, the thin solid line is the default (i.e. without cor- 
rection) configuration of the code, and the three other lines represent three 
different corrections that are applied at grid distances to compensate for the 
systematic force underestimation seen in the bottom panels of Fig. [7] 



of the residual errors can become significant, and one can observe, 
for instance, a growth of structure that does not match the predic- 
tions of linear theory even at the largest scales. One therefore needs 
to choose a maximum step size. In CUBEP 3 M, this is controlled by 
fmax, which is the fractional step size, da/(a + da) and is set to 0.05 
by default. Generally, a simulation should start at a redshift high 
enough so that the initial dimensionless power spectrum is well 
under unity at all scales. This ensures that the Zel'dovich approx- 
imation holds at the per cent level at least. Otherwise, truncation 
error at the early time steps will be significant, causing a drop of 
accuracy. 

It is possible to reduce this effect, and thereby improve signif- 
icantly the accuracy of the code, by modifying the value of r max , at 
the cost of increasing the total number of time steps. Fig.llllshows 
a comparison of late time power spectra of a series of CITA256 re- 
alizations that originate from the same initial conditions, and used 
the same random seeds to control the fine mesh shifts (mentioned 
above): only the value of r max was modified between each run. We 
observe that the impact is mostly located in the non-linear regime, 
where decreasing the time step to 0.006 allows the simulation to 
recover about 30 per cent of dimensionless power at the turn over 
scale. This gain is greatly dependent on the choice of initial red- 
shift, the resolution and the box size, and ideally one would make 
test runs in order to optimize a given configuration. As expected, 
the CPU resources required to run these simulations increase rapidly 
as r max decreases, as seen in Table|4] In this test case, reducing fur- 
ther at 0.001 shows only a mild ~ 5 per cent improvement in ac- 
curacy over 0.006, but the increase in time is more than a factor of 
four. The default configuration of the code is set to r max = 0.05, but 
in the light of this recent test, we recommend reducing to the value 
0.01 or even 0.005. We also mention here that with a proper use of 
second order initial conditions, it is possible to start the simulations 
at much later redshifts without loosing much accuracy. 
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Figure 11. Dark matter power spectrum, measured at z = in a series 
of CITA256 realizations. The starting redshift was raised to z = 200 to 
enhance the systematic effect. The different curves show different values of 
i~max- The resources required to run these simulations increase rapidly as 
r max decreases, as seen in Table|4] 



Table 4. Scaling in CPU resources as a function of the value of r max . The 
tests were performed on the CITA Sunnyvale cluster, and general trends 
could vary slightly on other machines. The time tabulated on the right col- 
umn corresponds to the full run time of the CITA256 simulations, which 
evolved 128 3 particles from z = 200 down to z = 0. 



max 


time (h) 


0.1 


1.46 


0.06 


1.48 


0.01 


1.67 


0.006 


1.91 


0.003 


2.83 


0.002 


4.13 


0.001 


8.15 



6 RUNTIME HALO FINDER 

We have implemented a runtime halo finding procedure, which we 
have developed based on the spherical overdensity (SO) approach 
ltacev& Cold 1 19941) . In the interest of speed and efficiency, the 
halo catalogues are constructed on-the-fly at a pre-determined list 
of redshifts. The halo finding is massively-parallel and threaded 
based on the main CUBEP 3 M data structures discussed in section 
[2] The code first builds the fine-mesh density for each sub-domain 
using CIC or NGP interpolation. It then proceeds to search for all 
local density maxima above a certain threshold (typically set to a 
factor of 100 above mean density) within each local tile. It then 
uses parabolic interpolation on the density field to determine more 
precisely the location of the maximum within the densest cell, and 
records the peak position and value. 

Once the list of peak positions is generated, they are sorted 
from the highest to the lowest density. Each halo candidates is then 
inspected independently, starting with the highest peak. The grid 
mass is accumulated in spherical shells of fine grid cells surround- 
ing the maximum, until the mean density within the halo drops be- 
low a pre-defined overdensity cutoff (usually set to 178 in units 
of the mean, in accordance with the top-hat collapse model). As 
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Figure 12. Simulated halo multiplicity function, — jjj, based on a 
RANGER4000 simulation with 3.2/r'Gpc box and 4000 3 particles (solid, 
red in the online ve rsion). For reference we also show a widely-used fit by 
iTinkeret all [2008) (blue, dotted). The particle masses are of 5.68xl0 10 M o . 



we accumulate the mass, we remove it from the mesh, so that no 
element is double-counted. This method is thus inappropriate for 
finding sub-haloes since, within this framework, those are naturally 
incorporated in their host haloes. Because the haloes are found on a 
discrete grid, it is possible, especially for those with lower mass, to 
overshoot the target overdensity. To minimize this effect, we correct 
the halo mass and radius with an analytical de nsity profile. We use 
the Truncated Isother mal Sphere (TIS) profile (Shapi ro"et al J 19991 : 
llliev & ShapkohoOlh for overdensities below ~ 130, and a simple 
1 jr 1 law for lower overdensities. TIS yields a similar outer slope to 
the Navarro, Frenk and White (NFW iNavarro et al.|[l997h profile, 
but extends to lower overdensities and matches well the virializa- 
tion shock position given by the self-similar collapse solution of 
lBertschingeil l ll985l) . 

After the correct halo mass, radius and position are deter- 
mined, we find all particles that are within the halo radius. Their 
positions and velocities are used to calculate the halo centre-of- 
mass, bulk velocity, internal velocity dispersion and the three an- 
gular momentum components, all of which are then included in the 
final halo catalogues. We also calculate the total mass of all parti- 
cles within the halo radius, also listed in the halo data. This mass 
is very close, but typically slightly lower, than the halo mass calcu- 
lated based on the gridded density field. The centre-of-mass found 
this way closely follows that found from the peak location, which 
is based on the gridded mass distribution. 

A sample halo mass function produced from a RANGER4000 
simulation at z = is shown in Fig. [T2] We compa r e our result 
to the precise fit presented recently by Tinker etal] 1 2008]). Un- 
like m ost other widely-used fits like the one by ISheth & Tormei] 
(2002), which are based on friends-of-friends (FOF) halo finders, 
this one relies on the SO search algorithm, whos e masses are sys- 
tematically differe nt from the FOF masses (e.g. lReedetal]|2007l: 
Tinker et "aT] |2008h . making this fit a better base for comparison 
here. Results show excellent agreement, within ~ 10 per cent for 
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all haloes with masses corresponding to 1000 particles or more. 
Lower-mass haloes are somewhat under-counted compared to the 
Tinker et "all J2008h fit, by ~ 20 per cent for 400 particles and by 
~ 40 per cent for 50 particles. This is largely due to the grid-based 
nature of our SO halo finder, which misses some of the low-mass 
haloes. It was shown that using more sophisticated halo finders 
(available only through post-processing calculations due to their 
heavier memory footprint) it is possible to recover most of the ex- 
pected number count. 



7 BEYOND THE STANDARD CONFIGURATION 

Whereas all the preceding sections contain descriptions and discus- 
sions that apply to the standard configuration of the code, many ex- 
tensions have been recently developed in order to enlarge the range 
of applications of CUBEP 3 M; this section briefly describes the most 
important of these improvements. 

7.1 Initial conditions 

As mention in section [2] the code starts off by reading a set of ini- 
tial conditions. These are organized as a set of 6 x A' phase-space 
arrays - one per MPI process - where A' is the number of particles 
in the local volume. Each MPI sub-volume generates its own ran- 
dom seed, which is used to create a local noise map. The initial 
power spectrum is found by reading off a transfer function at red- 
shift z = that is generated from CAMB by default, and evolved 
to the desired initial redshift with the linear ACDM growth func- 
tion. The noise map is then Fourier transformed, each element is 
multiplied by the (square root of the) power spectrum, and the re- 
sult is brought back to real space. To ensure that the global volume 
is correct, a buffer layer in the gravitational potential is exchanged 
across each adjacent node. The particles are then moved from a uni- 
form cell-centred position to a displaced position with Zel'dovich 
approximatior0. 

The density field produced by this algorithm follows Gaus- 
sian statistics, and is well suited to describe many systems. To in- 
crease the range of applicability of this code, we have extended 
this Gaussian initial conditions generator to include non-Gaussian 
features of the 'local' form, O(x) = 0(x) + /nl</>(x) 2 + gNL0( x ) 3 > 
where </i( x) is the Gaussian c ontribution to the Bardeen potential 
<D(x) (see lBartoloetalJ ( l2004l) for a review). We adopted the CMB 
convention, in which C> is calculated immediately after the matter- 
radiation equality (and not at redshift z = as in the large scale 
structure convention). For consistency, 0(x) is normalized to the 
amplitude of scalar perturbations inferred by CMB measurements 
(A, » 2.2 x 10~ 9 ). The local transformation is performed before 
the inclusion of the matter transfer function, and the initial particle 
positions and velocities are finally computed from O(x) according 
to the Zel'dovich approximation, as in the original Gaussian initial 
condition generator. 

This code was tested by comparing simulations and theoret- 
ical predictions for the effect of local primordial non-Gaussianity 

7 We remind that for Zel'dovich approximation to hold in such cases, the 
simulations need to be started at very early redshifts. Consequently, the size 
of the first few redshift jumps in such simulations can become rather large, 
and therefore less accurate, which is why we must carefully constrain the 
size of the jumps, as discussed in section l53l Another solution to this issue 
is to use second order perturbation theory to generate the initial conditions, 
which is not implemented yet in the public package. 




Figure 13. Dark matter power spectrum, measured at z = in a vol- 
ume 3.2/i~'Gpc per side, from 4000 3 particles. The two curves represent 
two RANGER4000 realizations of the same initial power spectrum, one of 
which used Gaussian statistics (blue in the online version) and the other 
the non-Gaussian initial condition generator (red online). The two curves 
differ at the sub-per cent level, as seen in the top panel, and the one-loop 
perturbation theory calculations (aqua online) accurately describes the ratio 
between the two curves up to k ~ 0.4/iMpc in this case. 

on the halo mass function and matter power spectrum (Desjacques, 
Seljak & Iliev 2009). It has also been used to quantify the impact 
of loc al non-Gaussian initial conditions on the halo pow er spec- 
trum (Desjacques et al. 2009; Desjacques & Seljak 2010) and bis- 
pectrum dSefusatti et al.ll201oh , as well as the matter bispectrum 
dSefusatti et alj|201 II) . Fig.U3lshows the late time power spectrum 
of two RANGER4000 realizations that started off the same initial 
power spectrum, but one of which had non-Gaussian features set 
to /)vl = 50. We see that the difference between the two power 
spectra is at the sub-per cent level, and that the ratio of the two 
power spectra is well de s cribed with one lo op perturbation theory 
dScoccimarro et al.ll2004lTaruva et alj2008h . 

7.2 Particle identification tags 

A system of particle identification can be turned on, which basically 
allows to track each particle's trajectory between checkpoints. Such 
a tool is useful for a number of applications, from reconstruction 
of halo merging history to tracking individual particle trajectories. 
The particle tag system has been implemented as an array of double 
precision integers, PID, and assigns a unique integer to each parti- 
cle during the initialization stage. Since the array takes a significant 
amount of memory, we opted to store the tags in a separate object 
- as opposed to adding an extra dimension to the existing xv ar- 
ray, such that it can be turned off when memory becomes an issue. 
Also, it simplifies many post-processing applications that read the 
position-velocity only. The location of each tag on the PID array 
matches the location of the corresponding particle on the xv ar- 
ray, hence in practice it acts just like an extra dimension. The tags 
on the array change only when particles exit the local volume, in 
which case the tag is sent along the particle to adjacent nodes in 



the pass_particle subroutine; similarly, deleted particles result 
in deleted tags. As for the xv arrays, the PID arrays get written to 
file in parallel at each particle checkpoint. 

7.3 Extended range of the pp force 

As discussed in section [5"!2l one of the main sources of error in the 
calculation of the force occurs from the PM interaction at the small- 
est scales of the fine grid. The NGP approximation is the least accu- 
rate there, which causes a maximal scatter about the exact 1/r 2 law. 
A straightforward solution to minimize this error consists in ex- 
tending the pp force calculation outside a single cell, up to a range 
where the scatter is smaller. Although this inevitably reintroduces a 
number of operations that scales as N 2 , our goal is to add the flex- 
ibility to have a code that runs slower, but produces results with a 
higher precision. 

To allow this feature, we have to choose how far outside a cell 
we want the exact pp force to be active. Since the force kernels 
on both meshes are organized in terms of grids, the simplest way 
to implement this feature is to shut down the mesh kernels in a 
region of specified size, and allow the pp force to extend therein. 
Concretely, these regions are constructed as cubic layers of fine 
mesh grids around a central cell; the freedom we have is to choose 
the number of such layers. 

To speed up the access to all particles within the domain of 
computation, we construct a thread safe linked list to be constructed 
and accessed in parallel by each core of the system, but this time 
with a head-of-chain that points to the first particle in the current 
fine mesh cell. We then loop over all fine grids, accessing the par- 
ticles contained therein and inside each fine grid cell for which 
we killed the mesh kernels, we compute the separation and the 
force between each pair and update their velocities simultaneously 
with Newton's third law. To avoid double counting, we loop only 
over the fine mesh neighbours that produce non-redundant contri- 
butions. Namely, for a central cell located at (jci,>i,Zi), we only 
consider the neighbours (x 2 ,y2,Z2) that satisfy one of the following 
conditions: 

• Z2 > Zl 

• Z2 = Z\ and y 2 > y\ 

• Z2 = Z\ and y 2 = y\ and X2 > Xj 

The case where all three coordinates are equal is already calcu- 
lated in the standard pp calculation of the code, hence we don't 
recount it here. To assess the improvement of the force calculation, 
we present in Fig[l4]a force versus distance plot analogous to Fig. 
[7] but this time the pp force has been extended to two layers of fine 
cells (again in a OTA 128 realization). We observe that the scatter 
about the theoretical curve has reduced significantly, down to the 
few percent level, and is still well balanced around the theoretical 
predictions. The fractional error on the radial and tangential com- 
ponents of the force, as seen in the right panel, are now at least five 
times smaller than in the default P 3 M algorithm. When averaging 
over 10 time steps, we observe that improvement is relatively mild, 
showing that the calculations are already very accurate. 

To quantify the accuracy improvement versus computing time 
requirements, we perform the following test. We generate a set 
of initial conditions at a starting redshift of z = 50, with a box 
size equal to 150/j~'Mpc, and with 512 3 particles. We evolve these 
SciNet512 realizations with different ranges for the pp calcula- 
tion, and compare the resulting power spectra. For the results to 
be meaningful, we again need to use the same seeds for the random 
number generator, such that the only difference between different 
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Figure 15. Dimensionless power spectrum for varying ranges of the exact 
pp force. These SciNet512 realizations evolved from a unique set of initial 
conditions at a starting redshift of z = 100 until z = 1.0, with a box size 
equal to 150/i~'Mpc. The only difference between the runs is the ranges of 
the pp calculation. The inset shows details about the resolution turnaround, 
and the thick vertical line corresponds to the coarse mesh scale. 



runs is the range of the pp force. Fig. 1 151 shows the dimensionless 
power spectrum of the different runs. We first see a significant gain 
in resolution when extending PM to P 3 M; gains roughly as strong 
are found when adding successively one and two layers of fine cell 
mesh in which the pp force is extended. We have not plotted the 
results for higher numbers of layers, as the improvement becomes 
milder there while the runs take increasingly more time to com- 
plete: For this reason, it seems that a range of two layers suffices to 
reduce most of the undesired NGP scatter. 

Extending the pp calculation comes at a price, since the num- 
ber of operations scales as TV 2 in the sub-domain. This cost is best 
captured by the increase of real time required by a fixed number 
of dedicated CPUs to evolved the particles to the final redshift. For 
instance, in our SciNet512 simulations, the difference between the 
default configuration and Ni ayer = 1 is about a factor of 2.78 in real 
time, and about 6.5 for Ni ayer = 2. This number will change depend- 
ing on the problem at hand and on the machine, and we recommend 
to perform performance gain vs resource usage tests on smaller runs 
before running large scale simulations with the extended pp force. 

The power spectrum does not provide the complete story, and 
one of the most relevant ways to quantify the improvement of the 
calculations is to compare the halo mass function from these dif- 
ferent SciNet512 runs. Fig. [T6] presents this comparison at redshift 
z = 1.0. About 76,000 haloes were found in the end, yielding a 
halo number density of about 0.0225 per [Mpc//?] 3 . We observe that 
the simulation undershoot the Sheth-Tormen predictions, which is 
caused by the relatively low resolution of the configuration com- 
pared to the RANGER4000 run. We also observe that the pure 
PM code yields up to 10 per cent less haloes than the P 3 M ver- 
sion over most of the mass range, whereas the extended pp algo- 
rithm generally recovers up to 10 per cent more haloes in the range 
10" - 10 13 M o . The difference in performance between an extended 
pp range of two vs four cells deep is rather mild, from where we 
conclude that Ni ay< , rs = 2 is the most optimal choice. 
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Figure 14. (top left:) Gravity force in the P 3 M algorithm, compared with the exact 1/r 2 law, in the same CITA128 realization as that shown in Fig. [7] except 
that the exact pp force has been extended to two fine mesh layers around each particle in the force test code. Particles in that range follow the exact curve, after 
which we observe a scatter at distances of the order of 2 fine grid that is much smaller than that observed in Fig. [7J The NGP interpolation scheme is again 
responsible for the scatter, but the effect is suppressed for increasing distances, (top right:) Fractional error on the force in the radial direction (points), over 
plotted with the scatter in the fractional tangential contribution (solid line). This was also obtained over a single time step, (bottom row.) Same as top row, but 
averaged over ten time steps. 



8 CONCLUSION 



documentation about the structure, compiling and running strategy 
is can be found on the CITA wiki pagqS 



This paper describes CUBEP 3 M, a public and massively parallel 
P 3 M N-body code that inherits from PMFAST and that now scales 
well to 20,000 cores, pushing the limits of the cosmological prob- 
lem size one can handle. This code is fast and has a memory foot- 
print up to three times smaller than Lean-GADGET-2. We summa- 
rize the code structure, review the double-mesh Poisson solver al- 
gorithm, and present scaling and systematic tests that have been 
performed. We also describe various utilities and extensions that 
come with the public release, including a run time halo finder, an 
extended pp force calculation, a system of particle ID and a non- 
Gaussian initial condition generator. CUBEP 3 M is one of the most 
competitive N-body code that is publicly available for cosmologists 
and astrophysicists, it has already been used for a large number 
of scientific applications, and it is our hope that the current docu- 
mentation will help the community in interpreting its outcome. The 
code is publicly available on github.com under cubep3m, and extra 
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Figure 16. (top:) Halo mass function for different ranges of the pp force 
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different curves and that of the default P 3 M configuration. 
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